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CHAPTER  1: 

Introduction 


The  Department  of  Defense  (DOD)  has  identified  collaboration  and  improved  access  to 
information  as  key  elements  of  future  operational  success.  This  need,  coupled  with  the 
massive  growth  in  data,  has  led  the  U.S.  Navy  and  other  DOD  entities  to  invest  in  cloud 
storage  capabilities  in  an  effort  to  cope  with  “Big  Data.”  In  2014,  Terry  Halvorsen,  then- 
Navy  Chief  Information  Officer  (CIO),  stated  the  Navy  needs  to  move  about  half  of  its 
unclassified  data  into  commercial  cloud  storage  [1].  In  late  2014,  as  acting  DOD  CIO, 
Halvorsen  released  a  memo  freeing  DOD  agencies  to  procure  their  own  commercial  cloud 
services,  without  using  Defense  Information  Systems  Agency  (DISA),  in  an  effort  to  speed 
up  the  migration  process  [2]. 

With  the  growing  use  of  cloud  storage  solutions,  there  is  a  corresponding  need  for  secure 
and  efficient  means  of  guaranteeing  data  integrity  and  availability.The  Federal  Cloud  Com¬ 
puting  Strategy  of  201 1  states  that  agencies  should  explicitly  state  security,  availability,  and 
quality  requirements  through  service  level  agreements,  and  routinely  monitor  vendor  com¬ 
pliance  [3].  The  DOD  Cloud  Computing  Strategy  of  2012  also  establishes  the  requirement 
for  cloud  services  to  provide  sufficient  security  to  ensure  the  integrity  and  availability  of 
DOD  information  [4].  In  2015,  the  DOD  released  its  Cloud  Computing  Security  Require¬ 
ments  Guide  (SRG),  which  outlines  the  security  requirements  for  DOD  agencies  procuring 
commercial  cloud  services.  Among  its  recommendations  are  policies  that  would  provide 
audit  and  accountability  for  data  additions,  deletions,  and  modifications  [5].  Recent  out¬ 
ages  for  well-known  cloud  storage  providers,  including  Amazon  S3  and  Microsoft  Azure, 
also  underscore  the  need  for  a  reliable  and  efficient  auditing  mechanism  to  ensure  data 
availability  and  integrity  as  agencies  migrate  to  the  commercial  cloud  [6]— [8] . 

Proof  of  data  possession  schemes  may  provide  the  best  mechanism  to  fulfill  these  demands 
to  actively  track  vendor  compliance  and  assure  the  integrity  of  data  in  storage.  Through 
the  use  of  cryptographic  protocols,  proof  of  data  possession  (PDP)  schemes  provide  prob¬ 
abilistic  guarantees  that  data  on  storage  servers  has  not  been  maliciously  or  inadvertently 
deleted  or  altered.  They  claim  to  provide  this  guarantee  at  low  cost  to  both  the  proving 
and  verifying  entities.  Its  guarantees  are  probabilistic  and  its  asymptotic  costs  are  strictly 
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sublinear  in  file  size.  This  technology  has  not  yet  been  implemented  by  or  for  a  commer¬ 
cial  service;  however,  there  has  been  substantial  research  in  PDP  and  other  data  integrity 
schemes  over  the  past  decade  [9]-[32]. 

While  multiple  PDP  schemes  have  been  proposed,  each  with  varying  degrees  of  efficiency 
and  security,  there  is  no  research  to  date  that  provides  in-depth  cost  analysis  comparing  the 
real-world  efficiencies  of  PDP  schemes.  All  prior  research  has  focused  on  two  aspects  of 
PDP  schemes:  providing  high  probability  guarantees  of  data  possession  (security)  while 
minimizing  the  size  of  the  challenge  and  response  (communication  complexity).  These  are 
important  criteria,  especially  in  bandwidth-constrained  environments;  however,  to  date,  no 
research  has  provided  comparisons  of  PDP  schemes  in  terms  of  real-world  costs  (time  to 
generate  proof,  time  to  verify  proof,  time  to  tag,  cost  to  store  tag  overhead,  cost  to  run  an 
audit  service,  cost  to  service  requests  from  an  audit  service,  etc). 

Our  research  fills  that  gap  by  (1)  collecting  and  analyzing  cost  data  for  four  PDP  schemes, 
(2)  providing  generic  cost  models  (mathematical  formulae  expressing  abstract  models 
which  can  be  used  to  infer  future  cost),  and  (3)  comparing  overall  cost  efficiency  of  each 
PDP  scheme.  Additionally,  instead  of  measuring  costs  primarily  in  terms  of  the  size  of  the 
query  and  response  -  a  bandwidth  concern  -  this  research  recognizes  (a)  the  importance 
of  processing  time  when  evaluating  the  cost  of  a  particular  scheme,  and  (b)  the  asymmet¬ 
ric  costs  associated  with  some  cloud  cost  models  (e.g.,  PUTs  are  typically  more  expensive 
than  GETs). 

Based  on  our  generic  cost  models,  we  show  that  the  basis  costs  to  audit  are  nearly  identical 
for  MAC-PDP,  A-PDP,  and  CPOR,  but  tag  and  storage  costs  are  different  enough  to  have 
a  significant  impact  on  total  cost  among  the  schemes.  We  also  show  that  the  total  cost 
of  MAC-PDP  and  CPOR  are  similar,  but  A-PDP  becomes  expensive  relative  to  the  other 
schemes  at  large  file  sizes,  due  to  its  higher  tag  and  storage  costs.  We  show  that  the  total 
basis  cost  (up-front  cost  to  tag  and  cumulative  cost  storing  and  auditing)  for  one  year  at  one 
audit  per  hour  of  a  1  GB  file  is  under  $1  for  MAC-PDP,  A-PDP,  and  CPOR,  but  that  cost 
ranges  from  $4,400  to  $38,700  across  schemes  for  a  1  PB  file. 
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CHAPTER  2: 
Background 


This  research  focuses  on  four  specific  PDP  schemes:  a  simple  MAC -based  PDP  scheme 
(MAC-PDP),  the  scheme  described  by  Ateniese,  Bums,  Curtmola,  Herring,  Kissner,  Pe¬ 
terson  and  Song  (A-PDP)  [33],  the  scheme  decsribed  by  Ateniese,  Pietro,  Mancini  and 
Tsudik  (SEPDP)  [34]  and  the  scheme  described  by  Shacham  and  Waters  (CPOR)  [35]. 
Our  research  is  primarily  concerned  with  building  accurate  cost  models  for  each  scheme 
based  on  experimental  audit  data.  Below,  we  provide  a  generic  description  of  PDP  and  a 
description  of  each  PDP  scheme  considered  in  our  experiments. 


2.1  Proof  of  Data  Possession 

A  PDP  system  can  be  divided  into  two  generic  phases:  the  set-up  phase  and  the  challenge 
phase.  In  the  set-up  phase,  a  client  generates  a  public  and  private  key  pair,  tags  the  file,  and 
uploads  the  file  and  tag  data  to  storage,  deleting  it  from  local  storage.  During  the  challenge 
phase,  the  client  generates  a  challenge  for  a  specified  number  of  file  blocks  and  sends  the 
challenge  to  the  proven  The  prover  uses  the  challenge  to  generate  a  proof  of  possession, 
which  is  returned  to  the  client.  The  client  then  validates  the  proof,  providing  a  probabilistic 
guarantee  that  the  prover  does  or  does  not  possess  the  client’s  file. 

Following  the  notation  of  Juels  and  Kaliski  [36]  and  Bower,  Juels,  and  Oprea  [37],  a  file 
M  can  be  divided  into  n  blocks,  M  =  We  let  P  denote  the  prover  (server), 

V  denote  the  verifier  (client),  //  denote  the  file’s  identifier,  and  to  denote  local  client  state. 
We  represent  unspecified  values  with  a  ±  symbol.  A  generic  PDP  scheme  can  be  consid¬ 
ered  a  five-tuple  of  algorithms,  (KeyGen,Tag, Challenge,  Proof,  Verify),  each  described  as 
the  following. 

KeyGenfl k)  — >  ( pk,sk ).  This  algorithm  is  used  by  the  client  to  generate  random  public 
and  private  keys  by  employing  security  parameter  k. 

Tag (M;  pk,sk,u>)  — >  M*.  This  algorithm  is  used  by  the  client  to  process  a  file  and  pro¬ 
duce  verification  tag  data.  It  takes  as  input  a  public  and  private  key  pair  ( pk,sk )  and 
file  M.  It  generates  a  file  ID  rj  and  returns  M*,  the  encoded  file  with  verification  tag 
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data.  It  also  updates  the  client  state  cj  to  include  and  locally  held  data  such  as  the  file 
ID,  file  size,  number  of  blocks,  etc.  The  data  M*  can  be  stored  remotely. 

Challenge!//; pk,sk,co )  — >  c.  This  algorithm  is  used  by  the  client  to  produce  a  challenge 
c.  This  challenge  will  sent  to  the  prover  during  an  audit. 

Proof {p,M*,c; pk )  — »  p.  This  algorithm  is  used  by  the  prover  to  demonstrate  proof  of 
possession  of  specified  file  blocks  as  a  response  to  challenge  c.  It  takes  as  input  the 
remote,  encoded  data  M*  and  challenge  c,  to  generate  proof  p. 

Verify(c,p,/7; pk,sk,co )  — >  b  e  {0,1}.  This  algorithm  is  used  by  the  client  to  validate  the 
proof  p.  It  takes  as  input  the  public  and  private  key  pair  (pk,sk ),  challenge  c  and 
proof  p.  Upon  successful  validation  it  returns  1,  else  it  returns  0. 


2.2  Constructions 

In  this  section,  we  provide  detailed  descriptions  of  each  PDP  scheme  employed  in  our 
study:  MAC-PDP,  A-PDP,  SEPDP  and  CPOR. 


2.2.1  MAC-PDP 

The  MAC-PDP  scheme  is  defined  below,  following  the  description  and  notation  from 
Shacham  and  Waters  [35]  and  Riebel  [38],  adapted  slightly  for  uniformity  with  the  other 
schemes  in  Section  2.2. 

Let  /  be  a  keyed  pseudo-random  function,  as  follows: 


./  :  {0, 1  }*  x  Kprf  ->  Zp 

KeyGenfl^')  — >  ( pk,sk ).  Choose  a  random  secret  key  for  a  hash-based  MAC  function 
k,nac  Kprf.  The  secret  key  is  sk  =  ( kmac )  and  public  key  is  pk  =_L. 

Tag {M\pk,sk,(jL>)  — >  M*.  The  file  is  split  into  n  blocks,  M  =  (mi, m2,  ...,mn).  Choose 
a  random  file  ID  //,  where  r/  e  Zp.  For  each  block  ///,,  (1  <  i  <  n ),  generate  tag 
cri  =  MACk,nac(rj\\mi).  The  data  stored  remotely  is  M*  =  (M,  {<t,}  , 

Challeng e(rp,pk,sk,a>)  — >  c.  Choose  a  random  Aelement  subset  I  Q  [1,//]  of  indices. 
Let  c  be  the  set  {/},•<=/. 

Proof(?7,M*,c;  pk)  —>  p.  For  each  i  6  c,  return  to  the  verifier  p  =  {(////, cr/)},ec. 
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? 

Ver\iy(c,p,q;pk,sk,u>)  — >  7  e  {0,1}.  For  each  i  e  c,  check  if  cr,  =  MAQmac(//||m;).  If 
all  l  checks  are  correct  then  return  7=1,  else  return  7  =  0. 


2.2.2  A-PDP 

The  A-PDP  scheme  is  defined  below,  following  the  description  and  notation  from  Ateniese 
et  al.  [33],  adapted  slightly  for  uniformity  with  the  other  schemes  in  Section  2.2. 

Let  H  be  a  cryptographic  hash  function,  h  be  a  full-domain  hash  function,  /  be  a  pseudo¬ 
random  function  and  n  be  a  pseudo-random  permutation  (PRP)  as  follows  (where  k,  £,  A 
are  security  parameters): 


h  :  {0, 1}*  — >  QRn  (QRn  is  the  set  of  quadratic  residues  modulo  N ) 
/:{0,l}*x{0,l}log2(”)  {0,1}^ 

n  :  {0,1}*  x  {0,l}log2("}  ->  {0, 1  }log2(/?) 

KeyGen(lfc)  — >  ( pk,sk ).  Choose  safe  primes  p,q ,  where  p  =  2 p'  +  1  and  q  =  2q'  +  1.  Let 
N  =  pq.  Let  g  be  a  generator  of  QRn,  the  set  of  quadratic  residues  modulo  N.  Let 
v  <—  {0,1}*.  The  public  key  pk  =  (N,g)  and  the  secret  key  sk  =  (e,d,v),  such  that 
e  is  a  large  secret  prime  with  ed  =  1  (mod  p'q'),  e  >  A,  d  >  A. 

Tag (M; pk,sk,u>)  — >  M*.  The  file  is  split  into  n  blocks,  M  =  For  each 

block  rrii,  compute  Tjm  =  ( h(Wj )  •  gmi)d  mod  N,  where  Wj  =  v\\i.  The  data  stored 
remotely  is  M*  =  (M\{(TjjnrWl)}\<l<n). 

Challeng e{q;pk,sk,co)  — »  c.  To  audit  i  blocks  of  M,  generate  challenge  c  = 
(£,k\,k2,gs),  where  k\  and  7  2  are  random  /e-bit  keys,  and  gs  =  gs  mod  N  for  random 


Pmo\{q,M*,c,pk )  — »  p.  For  1  <  j  <  t,  generate  indices  ij  =  nkl(j)  and  coeffi¬ 
cients  dj  =  fk2(j).  Compute  7  =  •  . . .  •  7^  =  (, h(Whr  ■■■■■  h(Wi{)at  • 

^1m,1+...+flfmif)rfmodjy.  Compute  p  =  H(g°im'l  + '+CI{m,t  mod  A).  The  proof  is 
7>  =  <Z\p>. 

\fer\ty(c,p,q;pk,sk,a> )  — >  7  6  {0,1}.  Let  r  =  Te .  For  1  <  j  <  l,  compute  ij  = 
nki(j)>Wij  =  v\\ ij,aj  =  fk2U),  and  r  =  h(VJi  f]  mod  N .  If  H(ts  mod  N)  =  p  then 
return  7=1,  else  return  7  =  0. 
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2.2.3  CPOR 

The  CPOR  scheme  is  defined  below,  following  the  description  and  notation  from  Shacham 
and  Waters  [35],  adapted  slightly  for  uniformity  with  the  other  schemes  in  Section  2.2. 

Let  /  be  a  keyed  pseudo-random  function,  as  follows: 


/:  {0, 1  }*  x  <Kprf  ->  Zp 


KeyGen(lA)  — >  ( pk,sk ).  Choose  a  random  key  kenc  <—  "Kenc  for  symmetric  encryption 
scheme  Enc,  and  a  random  HMAC  key  kmac  4—  (Kmac.  The  secret  key  is  sk  - 
(kencXnac)  and  public  key  is  pk  =_L. 

Tag {M\pk,sk,(jL>)  — >  M*.  Given  the  file  M,  split  M  into  n  blocks,  each  5  sectors 
long:  M  =  (niij)\<i<„.  Choose  a  PRF  key  kprf  <—  'Kprf  and  5  random  num- 

1  <j<S 

bers  a\,...,as  4-  Zp.  Let  r0  =  (n\\Er\ckenc(kprf\\a\\\ ■  ■  ■  \\as)).  The  file  tag  is 
r  =  (to |  |MACytmac  (to)).  For  each  i,  1  <  i  <  n,  compute 

Ci  <-  fkprfii)  +  Yj  aimij 

7  =  1 

The  data  stored  remotely  is  M*  =  ({m;y},  {cr,}). 

Challenge^;;?/:,  sCm)  — >  c.  Choose  a  random  Aelement  subset  I  Q  [1  ,n\.  For  each 
i  6  I  choose  random  v/  <—  Zp.  Let  c  be  the  set  { (/, 

Proof {tj,M*,c; pk)  — >  p.  The  prover  parses  c  as  {( i,Vi )}  and  computes 

f*j  X  Vitriij  for  1  <  j  <  s,  and  cr  <—  ^  y;<x, 

( i,vt)ec  ( i,Vj)ec 


The  proof  is  p  =  (pk,c)i<k<s- 

? 

Mer\\y{c,p;p\pk,sk,a>)  — >  b  e  {0,1}.  Checker  = 
then  return  b  =  1,  else  return  b  =  0. 


S 

X  Vjfk  M)  +  x  cnjPj.  If  equal 

(i>,)ec  '  7=1 


2.2.4  SEPDP 

The  SEPDP  scheme  is  defined  below,  following  the  description  and  notation  from  Ateniese 
et  al.  [34],  adapted  slightly  for  uniformity  with  the  other  schemes  in  Section  2.2. 
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Let  t  be  the  number  of  possible  challenges,  H  be  a  cryptographic  hash  function,  AE  be  an 
authenticated  encryption  scheme,  /  be  a  keyed  pseudo-random  function  and  n  be  a  keyed 
pseudo-random  permutation,  defined  as  follows: 

//:{0,ir->{0,l}rf 

/  :{0,l}*x{0,l}log(0  ->  {0,1  }L 
n  :  {0,1  }L  x  {0,l}log("j  -»  {0,  l}log(w) 

KeyGen(lA)  — >  ( pk,sk ).  Choose  secret  permutation  key  W  {0, 1}*,  master  challenge 
nonce  key  Z  {0, 1  }k  and  master  encryption  key  K  { 0, 1 } k .  The  secret  key 
sk  =  (W,Z,K).  The  public  key  pk  =_L. 

Tag (M;  pk,sk,a>)  —>  M*.  Divide  message  M  into  n  blocks.  Choose  the  number  t  of 
possible  random  challenges  and  the  number  i  of  block  indices  per  verification.  For 
each  1  <  i  <  t,  generate  the  i-th  tag  as: 

Generate  a  permutation  key  kj  =  fw(l)  and  nonce  Cj  =  fz(l). 

Compute  the  set  of  indices  { ij  e  [l,n\  \  1  <  j  <  £}  where  ij  =  O'). 

Compute  token  v,  =  Hr(cj,mil , . . .  ,mi( ). 

Encrypt  the  token  ay  <—  AE^(/,;y)- 
The  data  stored  remotely  is  M*  =  ay}). 

CbaWer\ge{r];pk,sk,CL> )  — >  c.  Generate  the  i-th  challenge  c  =  <k,,c/)  by  recomputing 
ki  =  fwd)  and  ct  =  fz{i). 

Proof pk)  —>  p.  Compute  z  =  H  (c/,m;y,. . .  ,m,y)  where  ij  =  The  proof 

is  p  =  (z,cri). 

1  ? 

\Zer\iy{c,p,T]',pk,sk,co)  — >  b  e  {0,1}.  Compute  v  =  AE K  (cr,).  If  v  -  (, i,z )  then  return 
b  =  1,  else  return  b  =  0. 


2.3  Cost  Complexity 

The  asymptotic  communication  complexity  for  each  target  PDP  scheme  is  summarized 
in  Table  2.1.  While  MAC-PDP  affords  a  simple  implementation,  it  is  criticized  for  its 
relatively  large  communication  complexity.  Schemes  like  A-PDP,  CPOR  and  SEPDP  are 
designed  with  the  goal  of  minimizing  communication  complexity  [33]-[35]. 
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Table  2.1:  Asymptotic  communication  complexity  of  MAC-PDP,  A-PDP, 
CPOR  and  SEPDP. 


Challenge 

Proof 

MAC-PDP 

0(£\og(n)) 

0{£(bs  +  k )) 

A-PDP 

0(}og{£  +  2k  +  log(A)) 

0(log(N)) 

CPOR 

0{£  +  (log(n)  +  d)) 

0(log(p)) 

SEPDP 

O(L) 

0{d  +  L ) 

The  block  size  bs  is  a  function  of  file  size  and  n,  the  number  of  file  blocks. 


2.4  Detection  Probability 

It  is  not  the  objective  of  this  study  to  compare  proofs  associated  with  PDP  schemes.  To 
compare  the  cost  of  each  scheme  does  require,  however,  selection  of  comparable  param¬ 
eters.  There  are  at  least  three  senses  in  which  PDP  schemes  might  be  considered  to  be 
comparable. 

Strength  of  Security.  For  a  scheme,  this  is  expressed  as  Pr  [forge],  the  probability  that  a 
prover  can  get  the  verifier  to  accept  a  forged  proof  as  valid  (i.e.,  when  it  was  com¬ 
puted  without  using  some  blocks  involved  in  the  challenge). 

Strength  of  Audit.  For  a  scheme,  this  is  expressed  as  Pr  [audit],  the  probability  that  a 
single  audit  will  appear  to  succeed  even  when  k  of  n  blocks  have  been  deleted.  For 
many  schemes,  this  is  a  combinatorial  argument  based  on  the  probability  that  the  £ 
random  challenge  indices  are  among  the  k  blocks  deleted. 

Efficiency  of  Recovery.  Some  PDP  schemes,  often  called  proof  of  retrievability  (POR) 
schemes,  have  the  additional  characteristic  that  the  original  file  can  be  recovered 
even  after  some  number  of  failed  audits.  For  such  a  scheme,  this  is  expressed  as 
Pr  [recover],  the  probability  of  retrieval  after  an  e  fraction  of  audits  have  failed. 

Comparison  across  schemes  in  these  senses  is  problematic  for  a  number  of  reasons:  (i) 
schemes  rely  on  different  primitives  (full-domain  hash  functions,  authenticated  encryption 
schemes,  pseudorandom  permutations)  making  parameter  selection  to  achieve  compara¬ 
ble  Pr  [forge]  difficult;  (ii)  schemes  have  expressed  these  properties  in  slightly  different 
adversarial  models  and  employing  slightly  different  arguments;  (iii)  arguments  have  been 
expressed  in  asymptotic  terms  rather  than  concrete  terms,  making  parameter  derivation  dif- 
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ficult,  especially  when  arguments  employ  bounds  that  are  known  to  not  be  tight.  Thus  we 
do  not  select  parameters  in  this  study  with  the  objective  of  providing  absolute  apples-to- 
apples  comparison  across  schemes.  Since  the  simple  combinatorial  arguments  employed 
for  Pr [audit]  tend  to  be  most  reusable,  we  prioritize  parameter  selection  for  comparability 
in  this  sense.  In  some  sense,  this  is  a  rather  insignificant  parameter  since  its  probability  can 
be  driven  arbitrarily  low  through  repeated  audits,  due  to  exponential  hardness  amplification 
of  passing  a  series  of  audits.  At  the  same  time,  selection  of  this  parameter  may  be  most 
directly  related  to  deriving  policy  on  how  often  one  performs  audits.  As  we  are  interested 
in  the  recurring  cost  of  audit,  it  is  a  natural  parameter  of  our  study  to  consider  carefully. 
We  leave  open  for  future  work  parameter  selection  to  facilitate  fair  comparison  in  terms  of 
Pr  [forge]  and  Pr[recover], 
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CHAPTER  3: 


Methodology 


This  chapter  discusses  our  experimental  environment,  methodology  for  how  timing  data  is 
gathered,  and  implementation  decisions  for  evaluating  the  performance  of  the  PDP  schemes 
under  evaluation. 


3.1  Experiment  Environment 

Our  PDP  experiments  can  be  divided  into  two  phases:  a  set-up  phase  and  an  audit  phase.  In 
the  set-up  phase,  the  client  generates  keys  ( pk,sk ),  generates  a  tagged  file  M*,  and  sends 
M*  to  remote  storage  (see  Figure  3.1a).  For  the  audit  phase,  the  client  generates  a  challenge 
c  and  sends  it  to  the  prover;  the  prover  responds  with  a  proof  p,  which  is  sent  to  the  client; 
the  client  verifies  the  proof  and  indicates  success  or  failure  (see  Figure  3.1b). 


auditor  generates 
challenge  c 


prover  responds  with 
proof  p 


Remote  storage 


Client  storage 


Remote  storage 


Client  storage 


(a)  Set-up  phase  of  PDP  protocol 


(b)  Audit  phase  of  PDP  protocol 


Figure  3.1:  Set-up  and  audit  phases  of  PDP  experiment. 

Adapted  from  [33]:  G.  Ateniese,  R.  Burns,  R.  Curtmola,  J.  Herring,  L.  Kissner,  Z.  Peter¬ 
son,  and  D.  Song,  "Provable  data  possession  at  untrusted  stores,”  in  Proceedings  of  the 
14th  ACM  Conference  on  Computer  and  Communications  Security ,  2007,  pp.  598-609. 
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3.2  Measurements  and  Costs 

It  is  important  to  define  what  system  costs  are  measured  in  each  of  our  experiments.  We 
depict  what  operations  are  included  in  each  of  our  measurements  in  Figure  3.2.  Generally, 
we  ignore  costs  associated  with  transfer  time  and  service  latency,  focusing  on  significant, 
recurring  computational  costs. 


CPS 


set-up 

phase 


audit 

phase 


Figure  3.2:  Timing  measurement  definitions,  highlighting  what  operations 
and  costs  are  included  in  each  measurement. 


In  the  set-up  phase  we  do  not  measure  the  cost  of  generating  keys  (pk,  sk).  During  tagging 
data,  we  ignore  the  cost  of  sending  the  file  and  tag  data  M*  to  the  storage  server  S.  In 
the  audit  phase,  we  ignore  the  transfer  time  involved  in  sending  the  challenge  to  prover  P 
and  in  returning  the  proof  to  client  C.  For  proof  generation,  however,  we  include  the  time 
associated  with  retrieving  challenge  blocks  from  local  or  remote  storage,  including  this  as 
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part  of  the  proof  time.  We  believe  the  cost  associated  with  parsing  the  challenge,  retrieving 
the  data  required  for  the  proof,  and  the  cost  of  generating  the  proof  itself  are  intimately 
related,  and  we  combine  these  in  our  measurement. 

3.3  Implementation 

Our  benchmark  test  is  a  single-threaded  application  written  in  C  using  the  libpdp  li¬ 
brary  [39],  an  open-source  C  library  providing  implementations  for  MAC-PDP,  A-PDP, 
CPOR,  and  SEPDP.  In  all  experiments,  our  benchmark  application  is  run  on  Amazon  Elas¬ 
tic  Cloud  (EC2).  The  client,  auditor  and  prover  are  each  run  on  the  same  EC2  instance:  an 
c3.xlarge  instance,  running  64-bit  Ubuntu  Server  14.04  LTS  using  HVM  virtualization.  In 
other  environments,  these  three  parties  might  be  separate  hosts  or  owned  by  separate  orga¬ 
nizations  (i.e.,  tagging  and  ingest  performed  by  the  data  owner,  and  auditing  performed  by 
a  third-party).  As  we  have  chosen  to  define  tag,  challenge  and  verify  timing  measurements, 
the  properties  of  the  network  connecting  these  parties  are  irrelevant  to  our  measurements 
and  so  we  elect  to  run  these  parties  on  the  same  host.  For  each  of  our  schemes,  we  conduct 
two  types  of  benchmarks:  using  local  data  storage  and  using  remote  data  storage.  For  lo¬ 
cal  storage  experiments,  M*  is  stored  at  the  EC2  instance’s  local  storage.  For  the  remote 
storage  experiments,  M*  is  stored  to  an  Amazon  S3  bucket. 

Table  3.1:  Default  benchmark  parameters  used  in  our  experiments. 

MAC-PDP  £  =  460,  kmac  =  20  bytes 

A-PDP  £  =  460,  N  =  1024  bits,  PRP  k\  =  16  bytes,  PRF  k2  =  20 
bytes 

CPOR  £  =  460,  kenc  =  32  bytes,  kprf  =  20,  kmac  =  20  bytes,  A  =  80, 
p  =  80  bits,  sector  size  =  9  bytes 

SEPDP  £  =  460,  AE  K  =  16  bytes,  PRP  W,Z  =  16  bytes,  PRF  kt  = 

20  bytes,  t  =  1 

On  less  otherwise  noted,  bs  =  4096  bytes  and  fs  =  225  bytes. 

Experiments  are  run  sequentially,  each  time  doubling  block  size  or  file  size  for  a  particular 
scheme.  Pre-experiment  trials  in  which  the  order  of  experiments  are  randomized  demon¬ 
strated  no  discernible  impact  to  our  results;  thus,  we  strongly  believe  our  trials  are  in¬ 
dependent  and  order  of  test  execution  had  no  impact  to  our  results.  Each  experiment  is 
performed  using  pre-generated,  random  input  file  data.  Every  experiment  is  repeated  three 
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times  (graphs  in  Chapter  4  show  raw  data  from  all  three  iterations).  The  default  parameters 
used  for  each  scheme  is  provided  in  Table  3.1. 
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CHAPTER  4: 
Analysis 


In  this  chapter,  we  analyze  the  timing  data  collected  for  each  of  the  five  major  PDP  algo¬ 
rithms:  KeyGen,  Tag,  Challenge,  Proof  and  Verify.  Each  algorithm  is  analyzed  separately 
across  all  four  schemes,  including  our  expectations  based  on  each  algorithm,  what  the  data 
actually  show,  and  the  cost  model  we  have  developed  for  each  scheme  and  algorithm. 

For  each  cost  model,  we  employ  the  following  notation: 

bs,  block  size  in  bytes 

fs,  file  size  in  bytes 

ss,  sector  size  in  bytes 

co, ci,. . .,  model-specific  constants. 

For  all  the  schemes,  fs/bs  yields  the  number  of  blocks  in  the  file  M.  In  each  experiment, 
there  is  a  point  where  the  file  size  and  block  size  are  such  that  the  total  number  of  blocks 
falls  below  the  default  number  of  challenges  selected  for  an  audit.  At  this  point,  fewer 
computations  are  performed,  resulting  in  faster  algorithm  times.  Otherwise,  all  schemes 
approach  some  threshold  where  proof  cost  becomes  constant.  All  model-specific  constants 
are  derived  experimentally  using  least-squares  approximation.  Unless  otherwise  noted,  all 
figure  times  are  in  seconds. 

4.1  Tag  File 

In  our  experiments,  there  is  no  theoretical  difference  between  running  the  Tag  algorithm 
with  local  data  or  using  AWS  S3.  Our  measurements  also  bear  this  out. 

4.1.1  MAC-PDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  tag  time  in¬ 
creases  linearly  (see  Figures  4.1a  and  4.2a).  When  the  file  size  remains  constant  and  as  the 
block  size  varies,  the  execution  time  is  nearly  constant  (see  Figures  4.1b  and  4.2b). 

This  is  explained  in  terms  of  MAC-PDP  generating  tags  via  a  hash-based  MAC  on  every 
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(a)  File  size  vs.  tag  time  (b)  Block  size  vs.  tag  time 

Figure  4.1:  File  and  block  size  vs.  tag  time  for  local  data  experiments. 


Block  size  (bytes) 


(b)  Block  size  vs.  tag  time 


Figure  4.2:  File  and  block  size  vs.  tag  time  for  S3  data  experiments. 


file  block.  Since  the  hash  algorithm  generates  a  digest  through  repeated  operations  on 
fixed-size  blocks,  the  operation  time  should  be  proportional  to  the  size  of  the  input.  We 
summarize  these  trends  in  Model  4.1,  which  expresses  the  tag  time  as  proportional  to  the 
file  size. 

co  +  ci-fs  (4.1) 
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4.1.2  A-PDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  tag  time  in¬ 
creases  linearly  (see  Figures  4.1a  and  4.2a).  When  the  file  size  is  held  constant  and  the 
block  size  increases,  the  tag  time  decreases  linearly  (see  Figures  4.1b  and  4.2b). 

This  is  explained  in  terms  of  A-PDP  generating  tags  through  modular  exponentiation  on 
every  block.  As  the  file  size  grows,  there  will  be  more  blocks  to  tag,  resulting  in  increased 
execution  time.  As  block  size  increases,  there  is  a  corresponding  decrease  in  the  number 
of  blocks  to  tag.  We  summarize  these  trends  in  Model  4.2,  which  expresses  the  tag  time  as 
proportional  to  the  file  size  and  inversely  proportional  to  the  block  size. 


cq  +  ct  •  fs/bs  +  C2  •  bs  +  C3  •  fs 


(4.2) 


4.1.3  CPOR 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  tag  time  in¬ 
creases  linearly  (see  Figures  4.1a  and  4.2a).  When  the  file  size  is  held  constant  and  the 
block  size  increases,  the  tag  time  remains  constant  (see  Figures  4.1b  and  4.2b). 

This  is  explained  in  terms  of  CPOR  generating  tags  through  nested  loops  of  modular  mul¬ 
tiplication  and  addition.  The  number  of  loops  is  determined  by  the  total  number  of  sectors. 
An  increase  in  file  size  results  in  a  corresponding  increase  in  the  number  of  sectors.  How¬ 
ever,  since  changes  in  block  size  have  little  to  no  effect  on  the  number  of  sectors,  the  al¬ 
gorithm  times  remain  nearly  constant  as  the  block  size  varies.  We  summarize  these  trends 
in  Model  4.3,  which  expresses  the  tag  time  as  proportional  to  the  file  size  and  inversely 
proportional  to  the  sector  size. 


cq  +  ci  •  fs  +  C2  •  fs/ss 


(4.3) 


17 


4.1.4  SEPDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  tag  time  in¬ 
creases  linearly  up  to  a  point,  after  which  the  tag  time  remains  constant  (see  Figures  4.1a 
and  4.2a).  When  the  file  size  is  held  constant  and  the  block  size  increases,  the  tag  time 
increases  linearly  up  to  a  point,  after  which  the  tag  time  remains  constant  (see  Figures  4.1b 
and  4.2b). 

This  is  explained  in  terms  of  SEPDP  generating  tokens  by  calculating  the  hash  of  a  spec¬ 
ified  number  of  blocks.  The  tag  time,  then,  is  proportional  to  the  number  of  bytes  being 
processed,  which  is  determined  by  the  number  of  blocks  per  token  and  the  block  size.  The 
number  of  blocks  per  token  is  defined  by  the  default  security  parameter  £,  unless  the  block 
and  file  sizes  are  such  that  there  are  fewer  blocks  than  the  default  parameter,  in  which  case 
the  token  consists  of  all  the  blocks  in  the  file.  We  summarize  these  trends  in  Model  4.4, 
which  expresses  the  tag  time  as  proportional  to  the  total  number  of  bytes  processed  per 
token. 


(co  +  ci  •  mm((min(fs/bs,£)  •  bs,fs ))  •  t  (4.4) 

Above,  min((min (fs/bs,£)  ■  bs,fs )  is  essentially  the  number  of  bytes  processed.  When 
fs/bs  <  r,  the  entire  file  is  processed  to  generate  tokens. 


4.2  Generate  Challenge 

In  our  experiments,  there  is  no  theoretical  difference  between  running  the  Challenge  algo¬ 
rithm  with  local  data  or  using  AWS  S3.  Our  measurements  and  resultant  models  also  bear 
this  out. 

4.2.1  MAC-PDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  challenge 
time  runs  in  constant  time  up  to  a  point,  after  which  it  runs  in  a  slower  constant  time  (see 
Figures  4.3a  and  4.4a).  When  the  file  size  is  held  constant  and  the  block  size  increases,  the 
challenge  time  runs  in  constant  time  up  to  a  point,  after  which  it  runs  in  a  faster  constant 
time  (see  Figures  4.3b  and  4.4b). 
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(a)  File  size  vs.  challenge  time 


Block  size  (bytes) 

(b)  Block  size  vs.  challenge  time 

Figure  4.3:  File  and  block  size  vs.  generate  challenge  time  for  local  data 
experiments. 


Figure  4.4:  File  and  block  size  vs.  generate  challenge  time  for  S3  data 
experiments. 


This  is  explained  in  terms  of  t  and  the  total  number  of  file  blocks,  given  by  fs/bs.  When 
there  are  fewer  total  blocks  than  £,  then  all  indices  are  used  for  the  challenge.  However, 
when  there  are  more  blocks  than  £,  then  the  challenge  indices  must  be  chosen  without  re¬ 
placement,  which  still  runs  in  constant  time,  but  takes  longer  than  simply  using  all  available 
indices.  We  summarize  these  trends  in  Model  4.5,  which  expresses  the  challenge  time  as 
one  of  two  constants. 
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Ifs/bs]  <  £  :c0 
[fs/bs]  >  £  :c\ 


(4.5) 


4.2.2  A-PDP 

We  observe  that  generate  challenge  runs  in  constant  time  regardless  of  file  or  block  size  (see 
Figures  4.3  and  4.4).  This  is  explained  in  terms  of  the  A-PDP  challenge  being  independent 
of  the  file  or  block  size.  We  summarize  these  trends  in  Model  4.6,  which  expresses  the 
challenge  time  as  constant. 


co 


(4.6) 


4.2.3  CPOR 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  generate  chal¬ 
lenge  time  increases  linearly  up  to  a  point,  after  which  it  runs  in  constant  time  (see  Fig¬ 
ures  4.3a  and  4.4a).  When  the  file  size  is  held  constant  and  the  block  size  increases,  the 
challenge  time  runs  in  constant  time  up  to  a  point,  after  which  it  decreases  linearly  (see 
Figures  4.3b  and  4.4b). 

This  is  explained  in  terms  of  CPOR  generating  a  random  ^-element  set  for  the  challenge. 
As  the  file  size  increases,  the  size  of  this  set  increases,  until  the  number  of  blocks  exceeds 
('.  Similarly,  when  the  block  size  increases  to  the  point  where  there  are  fewer  total  blocks 
than  €,  then  the  size  of  the  challenge  set  will  begin  to  decrease.  We  summarize  these  trends 
in  Model  4.7,  which  expresses  the  challenge  time  as  either  constant  or  proportional  to  the 
total  number  of  blocks. 


lfs/bs]  <  £  :  c\  +  C2  •  fs/bs 

Ifs/bs]  >£  :c0  (4.7) 
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4.2.4  SEPDP 


We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  generate  challenge 
runs  in  constant  time  (see  Figures  4.3a  and  4.4a).  As  the  file  size  is  held  constant  and  the 
block  size  increases,  challenge  runs  in  constant  time  up  to  a  point,  after  which  the  run  time 
is  almost  twice  as  slow  (see  Figures  4.3b  and  4.4b). 

The  former  trend  is  explained  in  terms  of  SEPDP  recomputing  k and  q  for  the  z'-th  chal¬ 
lenge,  neither  of  which  is  affected  by  the  file  size.  We  are  unable  to  explain  the  latter  trend. 
Nothing  in  the  algorithm  design  suggests  that  block  size  should  affect  the  run  time,  and  we 
believe  that  the  anomaly  is  an  artifact  of  implementation,  not  a  feature  of  the  scheme.  We 
summarize  these  trends  in  Cost  Model  4.8,  which  expresses  the  challenge  time  as  constant. 


co 


(4.8) 


4.3  Generate  Proof 

In  our  experiments,  there  is  a  noticeable  difference  between  timing  for  the  Proof  algorithm 
using  local  data  storage  compared  to  using  remote  data  storage  using  AWS  S3.  We  analyze 
these  two  sets  of  experiments,  separately. 

For  experiments  interacting  with  S3,  we  observe  that  when  block  size  is  held  constant  and 
file  size  increases,  the  proof  time  increases  linearly  up  to  the  point  where  the  number  of 
blocks  exceeds  £,  after  which  the  proof  time  is  constant  (see  Figure  4.7a).  When  the  file 
size  is  held  constant  and  the  block  size  increases,  the  proof  time  is  nearly  constant  up  to  the 
point  where  t  exceeds  the  number  of  blocks,  after  which  the  proof  time  decreases  linearly 
(see  Figure  4.7b). 

This  is  explained  in  terms  of  each  GET  from  S3  taking  significantly  more  time  than  gener¬ 
ating  the  proof  itself  (see  Figure  4.6).  Thus,  the  number  of  GETs  dominates  the  trend.  For 
MAC-PDP,  A-PDP,  and  CPOR  there  is  one  GET  for  each  challenged  block  and  one  GET 
for  each  corresponding  tag  (see  Figure  4.5).  This  is  summarized  in  Equation  4.9,  which 
expresses  the  number  of  GETs  as  twice  the  total  number  of  blocks  or  twice  £,  whichever  is 
less. 
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2  •  min (fs/bs, £) 


(4.9) 


For  SEPDP,  there  is  one  GET  for  each  challenged  block,  but  only  one  GET  for  the  token 
corresponding  to  the  i-th  challenge  (see  Figure  4.5).  This  is  summarized  in  Equation  4.10, 
which  express  the  number  of  GETs  as  one  more  than  the  total  number  of  blocks  or  one 
more  than  £,  whichever  is  less. 


min  (fs/bs, £)  +  1 


(4.10) 


4.3.1  MAC-PDP 

For  local  data  experiments,  we  observe  that  when  block  size  is  held  constant  and  file  size 
increases,  the  proof  time  increases  linearly  up  to  the  point  where  the  number  of  blocks 
exceeds  £,  after  which  the  proof  time  is  nearly  constant,  increasing  slightly  as  the  file  size 
grows  (see  Figure  4.6a).  When  the  file  size  is  held  constant  and  the  block  size  increases, 
the  proof  time  increases  linearly  up  to  the  point  where  £  exceeds  the  number  of  blocks, 
after  which  the  proof  time  is  constant  (see  Figure  4.6b). 

This  is  explained  in  terms  of  MAC-PDP  generating  a  proof  containing  a  message  block 
and  hash  for  each  index  in  the  challenge.  The  proof  is  dependent  on  the  total  number  of 
bytes  hashed.  We  summarize  these  trends  in  Model  4.11,  which  expresses  the  proof  time 
as  proportional  to  the  total  number  of  blocks,  file  size,  and  block  size. 


r fs/bs]  <  £  :  co  +  ci  •  fs/bs  +  C2  ■  bs  +  ■  fs 

f  fs/bs]  >  £  :  C4  +  cs  ■  bs  (4.11) 
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(a)  File  size  vs.  GETs  (b)  Block  size  vs.  GETs 

Figure  4.5:  File  and  block  size  vs.  number  of  GETs  from  S3. 


Figure  4.6:  File  and  block  size  vs.  generate  proof  time  for  local  data  experi¬ 
ments. 


4.3.2  A-PDP 

For  local  data  experiments,  we  observe  that  when  block  size  is  held  constant  and  file  size 
increases,  the  proof  time  increases  linearly  up  to  the  point  where  the  number  of  blocks 
exceeds  £,  after  which  the  proof  time  remains  constant  (see  Figure  4.6a).  When  the  file 
size  is  held  constant  and  the  block  size  increases,  the  proof  time  increases  linearly  (see 
Figure  4.6b). 

This  is  explained  in  terms  of  A-PDP  generating  proofs  through  modular  exponentiation 
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(a)  File  size  vs.  proof  time  (b)  Block  size  vs.  proof  time 

Figure  4.7:  File  and  block  size  vs.  generate  proof  time  for  S3  data  experi¬ 
ments. 


of  £  message  blocks.  Thus  the  proof  time  will  depend  on  the  total  number  of  challenge 
blocks  as  well  as  the  size  of  each  block.  We  summarize  these  trends  in  Model  4.12,  which 
expresses  the  proof  time  as  proportional  to  the  number  of  blocks,  file  size,  and  block  size 
or  proportional  to  just  the  block  size. 


Ifs/bs]  <  £  :  C2  +  C3  •  fs/bs  +  C4  ■  bs  +  c$  ■  fs 

Ifs/bs]  >  £  \  cq  +  c\  ■  bs  (4.12) 


4.3.3  CPOR 

For  local  data  experiments,  we  observe  that  when  block  size  is  held  constant  and  file  size 
increases,  the  proof  time  increases  linearly  up  to  the  point  where  the  number  of  blocks 
exceeds  7,  after  which  the  proof  time  remains  constant  (see  Figure  4.6a).  When  the  file  size 
is  held  constant  and  the  block  size  increases,  the  proof  time  increases  linearly  up  the  to  the 
point  where  £  exceeds  the  number  of  blocks,  after  which  the  proof  time  remains  constant 
(see  Figure  4.6b). 

This  is  explained  in  terms  of  CPOR  generating  the  proof  by  computing  nj  and  cr  for  each 
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of  the  indices  in  the  challenge  set.  Additionally,  n}  includes  modular  multiplication  of  all 
the  sectors  of  each  challenge  block.  Therefore,  the  proof  time  increases  with  the  indices  in 
the  challenge  set,  as  well  as  when  the  block  size  increases.  We  summarize  these  trends  in 
Model  4.13,  which  expresses  the  proof  time  as  proportional  to  the  number  of  blocks,  file 
size,  and  block  size,  or  proportional  to  just  the  block  size. 


\  fs/bs]  <  £  :  cO  +  cl  •  fs/bs  +  c2  ■  bs  +  c3  ■  fs 

[fs/bs]  >  €  :  c4  +  c5  •  bs  (4.13) 


4.3.4  SEPDP 

For  local  data  experiments,  we  observe  that  when  block  size  is  held  constant  and  file  size 
increases,  the  proof  time  increases  linearly  up  to  the  point  where  the  number  of  blocks 
exceeds  £,  after  which  the  proof  time  remains  constant  (see  Figure  4.6a).  When  the  file  size 
is  held  constant  and  the  block  size  increases,  the  proof  time  increases  linearly  up  the  to  the 
point  where  £  exceeds  the  number  of  blocks,  after  which  the  proof  time  remains  constant 
(see  Figure  4.6b). 

This  is  explained  in  terms  of  SEPDP  generating  the  proof  by  computing  the  hash  of  all 
the  message  blocks  for  a  particular  token.  The  proof  time  is  proportional,  then,  to  the  total 
number  of  bytes  being  hashed,  given  by  the  number  of  challenge  blocks  and  block  size.  We 
summarize  these  trends  in  Model  4.14,  which  expresses  the  proof  time  as  proportional  to 
the  total  number  of  blocks,  block  size,  and  file  size,  or  proportional  to  just  the  block  size. 


I  fs/bs]  <  £  :  cO  +  cl  •  fs/bs  +  c2  ■  bs  +  c3  ■  fs 
f fs/bs]  >  £  :  c4  +  c5  ■  bs  (4.14) 
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4.4  Verify  Proof 

In  our  experiments,  there  is  no  theoretical  difference  between  running  the  Verify  algorithm 
with  local  data  or  using  AWS  S3.  Our  measurements  and  resultant  models  also  bear  this 
out. 

4.4.1  MAC-PDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  verify  time 
increases  linearly  up  to  the  point  where  the  number  of  challenge  blocks  exceeds  £,  after 
which  it  remains  constant  (see  Figures  4.8a  and  4.9a).  When  the  file  size  is  held  constant 
and  the  block  size  increases,  the  verify  time  increases  linearly  up  to  the  point  where  £ 
exceeds  the  total  number  of  blocks,  after  which  it  remains  constant  (see  Figures  4.8b  and 
4.9b). 

This  is  explained  in  terms  of  MAC-PDP  verifying  a  proof  by  hashing  each  index  in  the 
challenge.  Therefore,  the  verify  time  is  dependent  on  the  total  number  of  bytes  hashed.  We 
summarize  these  trends  in  Model  4.15,  which  expresses  the  proof  time  as  proportional  to 
the  file  size  or  proportional  to  the  block  size. 


lfs/bs]  <  £  :c0  +  ci  ■  fs 

| \fs/bs~\  >  £  :  C2  +  C3  •  bs  (4.15) 


4.4.2  A-PDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  verify  time 
increases  linearly  up  to  the  point  where  the  total  number  of  blocks  exceeds  £,  after  which 
it  runs  in  constant  time  (see  Figures  4.8a  and  4.9a).  When  the  file  size  is  held  constant  and 
the  block  size  increases,  the  verify  time  remains  constant  up  to  the  point  where  £  exceeds 
the  total  number  of  blocks,  after  which  it  decreases  linearly  (see  Figures  4.8b  and  4.9b). 

This  is  explained  in  terms  of  A-PDP  verifying  proofs  by  generating  r  and  comparing  the 
hash  of  r  with  p.  Since  r  is  computed  by  generating  £  hashes,  the  algorithm  time  will 
be  proportional  to  the  total  number  of  blocks  that  were  challenged.  We  summarize  these 
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Verify  file  Verify  file 


(a)  File  size  vs.  verify  time 


Figure  4.8:  File  and  block  size  vs.  verify  proof  time  for  local  data  experiments. 


(a)  File  size  vs.  verify  time 

Figure  4.9:  File  and  block  size  vs.  verify  proof  time  for  S3  data  experiments. 


trends  in  Model  4.16,  which  expresses  the  verify  time  as  constant  or  proportional  to  the 
total  number  of  blocks. 


Ifs/bs]  <  £  :c0 

Ifs/bs]  >  £  :  ci  +  C2  •  fs/bs  (4.16) 
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4.4.3  CPOR 


We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  verify  time 
increases  linearly  up  to  the  point  where  the  total  number  of  blocks  exceeds  l,  after  which 
it  runs  in  constant  time  (see  Figures  4.8a  and  4.9a).  When  the  file  size  is  held  constant  and 
the  block  size  increases,  the  verify  time  increases  linearly  (see  Figures  4.8b  and  4.9b). 

This  is  explained  in  terms  of  CPOR  verifying  the  proof  by  summing  a j/ij  for  all  sectors 
of  each  block  being  challenged.  As  the  file  size  grows,  the  number  of  sectors  for  each 
challenge  increases.  As  the  block  size  grows,  the  number  of  sectors  per  block  increases. 
We  summarize  these  trends  in  Model  4.17,  which  expresses  the  verify  time  as  proportional 
to  the  number  of  blocks,  file  size,  and  block  size,  or  proportional  to  just  the  block  size. 


[fs/bs]  <  £  :  co  +  ci  •  fs/bs  +  C2  ■  bs  +  ■  fs 

[fs/bs]  >  £  :c4  +  C5  •  bs  (4.17) 


4.4.4  SEPDP 

We  observe  that  when  block  size  is  held  constant  and  file  size  increases,  the  verify  time 
remains  constant  (see  Figures  4.8a  and  4.9a).  When  the  file  size  is  held  constant  and  the 
block  size  increases,  the  verify  time  remains  constant  up  to  a  point,  after  which  the  verify 
time  runs  about  twice  as  slow  (see  Figures  4.8b  and  4.9b). 

This  is  explained  in  terms  of  the  SEPDP  verify  algorithm  decrypting  cr,-  and  comparing  it 
with  the  proof.  The  decryption  time  should  not  be  dependent  on  file  size.  Additionally,  the 
decryption  time  should  not  be  dependent  on  block  size,  and  we  believe  that  the  anomaly  is 
an  artifact  of  implementation,  not  a  feature  of  the  scheme.  We  summarize  these  trends  in 
Model  4.18,  which  expresses  the  verify  time  as  a  constant. 


co 


(4.18) 
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4.5  Total  Cost 

We  break  costs  down  into  three  basic  categories  for  analysis:  (1)  the  cost  to  tag,  which 
includes  the  computational  costs  to  compute  the  tag  and  the  PUT  costs  of  uploading  the 
tag;  (2)  the  cost  to  store  the  tag;  (3)  the  audit  cost,  which  includes  the  computational  cost  to 
challenge,  prove,  and  verify,  and  the  GET  costs  associated  with  retrieving  file  blocks  and 
tags  during  those  operations. 

SEPDP  is  not  depicted  on  the  cost  graphs  because  its  use  of  audit  tokens  does  not  compare 
well  with  the  other  schemes.  Whereas  MAC-PDP,  APDP,  and  CPOR  all  support  an  unlim¬ 
ited  number  of  audits  once  the  file  is  tagged,  the  number  of  audits  for  SEPDP  is  chosen  in 
advance.  Thus  a  total  cost  graph  for  SEPDP  will  depend  on  the  desired  frequency  of  audits 
before  a  file  needs  to  be  retagged. 

We  note  that  the  costs  in  our  results  should  be  thought  of  as  minimal  costs.  We  have 
ignored  auditor  costs  associated  with  waiting  for  a  response  from  the  prover,  as  well  as 
wake-up  costs  for  the  prover  when  it  receives  a  proof  request,  which  we  do  not  measure 
as  part  of  our  experiments  (see  Figure  3.2).  Measuring  these  costs  would  reflect  network 
latency  and  implementation-specific  details  we  do  not  believe  to  be  strongly  related  to  PDP 
Also,  in  a  scaled  implementation  of  PDP,  where  multiple  audits  are  performed  for  clients, 
simultaneously,  the  downtime  costs  may  not  be  consequential.  Thus  the  basis  costs  we 
depict  do  not  reflect  actual  costs,  but  can  accurately  reflect  cost  comparisons  among  the 
schemes. 

We  chose  to  implement  our  benchmark  tests  on  Amazon  Web  Services  (AWS);  however, 
there  are  several  alternatives  with  comparable  pricing  schemes  and  storage  options.  For 
example,  Microsoft  Azure  Blob  storage,  Google  Cloud  Storage,  and  Rackspace  Cloud  Files 
all  have  similar  storage  services  and  pricing  schemes  as  Amazon.  The  AWS  S3  storage 
pricing  scheme  is  shown  in  Table  4.11. 


'Prices  were  obtained  from  https://aws.amazon.com/s3/pricing  as  of  March  2016. 
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Table  4.1:  Amazon  Web  Services  S3  standard  storage  pricing  scheme. 


Cost/GB 

First  1  TB  /  month 

$0.0300 

Next  49  TB  /  month 

$0.0295 

Next  450  TB  /  month 

$0.0290 

Next  500  TB  /  month 

$0.0285 

Next  4000  TB  /  month 

$0.0280 

Over  5000  TB  /  month 

$0.0275 

Table  4.2:  Comparison  of  cloud  providers  remote  storage  limitations. 

Max  object  size  Max  PUT  size  Max  metadata  size 


Amazon  S3 

5  TB 

5  GB 

2  KB 

Microsoft  Azure 

195  GB 

64  MB 

8  KB 

Google  Cloud  Storage 

5  TB 

5  TB 

unspecified 

Rackspace 

5  GB 

5  GB 

4  KB 

4.5.1  Tag  Costs 

Tag  costs  consist  of  the  cost  to  generate  the  tag  and  the  PUT  costs  associated  with  uploading 
the  file  to  storage  (see  Figure  4.10).  These  costs  resemble  the  trends  we  observed  for 
computational  costs  associated  with  generating  a  tag  (see  Figure  4.1a),  with  A-PDP  being 
the  most  expensive,  followed  by  CPOR,  and  MAC-PDP.  The  approximate  basis  costs  to  tag 
a  file  range  from  a  fraction  of  a  cent  to  $3  for  a  1  GB  file;  $0.13  to  $20  for  a  1  TB  file;  and 
$135  to  $20,400  for  a  1  PB  file. 


Figure  4.10:  Cost  to  tag,  based  on  tag  algorithms  and  AWS  EC2  pricing 
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4.5.2  Storage  Costs 

We  calculate  the  storage  cost  for  each  scheme  (see  Figure  4.12)  based  on  their  correspond¬ 
ing  tag  sizes  (see  Table  4.3).  As  the  file  size  increases  the  tag  file  overhead  increases 
linearly  for  MAC-PDP,  A-PDP,  and  CPOR,  but  remains  constant  for  SEPDP;  however,  as 
the  block  size  increases,  the  tag  file  overhead  decreases  linearly  for  MAC-PDP,  A-PDP,  and 
CPOR,  but  increases  linearly  for  SEPDP  (see  Figure  4.11).  Since  A-PDP  has  the  largest 
tag  size,  it  has  the  highest  storage  cost.  MAC-PDP  and  CPOR  have  almost  the  same  tag 
size  and,  therefore,  very  similar  storage  costs. 


Figure  4.11:  File  and  block  size  vs.  tag  file  overhead. 


Table  4.3:  Tag  file  overhead  and  tag  size  for  each  scheme  (bs  =  4096  bytes). 


Total  tag  file  overhead  (%  fs) 

Tag  size  (bytes) 

A-PDP 

4.864% 

204 

MAC-PDP 

0.477% 

20 

CPOR 

0.429% 

18 

We  investigated  the  option  of  storing  tags  as  metadata  to  reduce  cost;  however,  all  the  stor¬ 
age  providers  we  reviewed  included  metadata  as  part  of  the  overall  file  size.  Additionally, 
at  the  time  of  publication,  AWS  S3  limits  metadata  storage  to  2KB.  The  maximum  file  sizes 
at  which  the  tags  can  be  stored  as  metadata  on  AWS  S3  are  shown  in  Table  4.4. 
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Figure  4.12:  Cost  to  store  tag,  based  on  scheme  tag  overhead  and  AWS  S3 
pricing 


Table  4.4:  Maximum  file  sizes  at  which  tags  can  be  stored  as  metadata  on 
AWS  S3. 


File  size 

MAC-PDP 

428  kb 

A-PDP 

41  kb 

CPOR 

476  kb 

SEPDP 

0  kb 

4.5.3  Audit  Costs 

We  calculate  the  total  audit  cost  (see  Figure  4.13)  by  determining  the  number  of  GETs 
and  computational  cost  to  generate  a  challenge,  generate  a  proof,  and  verify  the  proof. 
Since  the  proof  time  is  significantly  larger  than  the  challenge  or  verify  times  (compare 
Figure  4.7  with  Figures  4.4  and  4.9),  we  are  not  surprised  to  find  the  proof  time  dictates  the 
audit  cost  trends.  Additionally,  the  differences  in  proof  times  observable  in  the  local  data 
experiments  (see  Figure  4.6)  nearly  disappear  in  the  S3  experiments  due  to  the  relatively 
larger  times  required  to  communicate  with  S3  and  transfer  proof  data.  As  a  consequence 
of  the  communication  time  common  to  all  schemes,  the  audit  costs  are  nearly  identical  for 


32 


MAC-PDP,  A-PDP,  and  CPOR. 


It  is  worth  noting  that  the  audit  cost  for  SEPDP  is  approximately  half  that  of  the  three  other 
schemes.  The  SEPDP  proof  scheme  has  fewer  GETs  than  the  other  schemes  since  it  only 
retrieves  a  single  tag  file  in  each  audit,  instead  of  a  tag  per  challenge  block,  as  in  the  other 
schemes. 


Figure  4.13:  Cost  to  audit,  based  on  audit  cost  models  and  AWS  EC2  and 
S3  pricing 


4.5.4  Combined  Cost  Scenarios 

We  observe  that  the  monthly  cost  to  store  and  audit  once  per  hour  is  nearly  identical  for  all 
schemes  until  the  storage  costs  begin  to  dominate  at  larger  file  sizes,  after  which  A-PDP 
becomes  much  more  expensive  than  MAC-PDP  and  CPOR  (see  Figure  4.15). 

Since  the  audit  costs  are  nearly  identical  for  all  three  schemes,  the  tag  and  storage  costs 
have  the  most  significant  impact  on  the  total  cost  of  each  scheme.  Figures  4.14a  and  4.14b 
show  the  up-front  cost  to  tag  and  cumulative  cost  storing  and  auditing  a  1  GB  and  1  TB 
file,  respectively,  at  one  audit  per  hour  each  month.  For  the  1  GB  file,  the  tag  and  storage 
costs  are  less  significant  and  the  slightly  higher  audit  cost  of  MAC-PDP  can  be  observed 
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at  one  year  of  audits;  however,  the  high  tag  and  storage  costs  of  the  1  TB  file  dominate, 
resulting  in  a  higher  cost  for  the  A-PDP  scheme.  The  following  are  approximate  basis  costs 
incorporating  up-front  cost  to  tag  and  cumulative  cost  storing  and  auditing  at  one  audit  per 
hour  for  one  year:  $160  to  $175  for  a  1  GB  file;  $170  to  $230  for  a  1  TB  file;  and  $2,000 
to  $38,700  for  a  1  PB  file. 


(a)  Tag,  storage,  and  audit  costs  for  1  GB  file  (b)  Tag,  storage,  and  audit  costs  for  1  TB  file 

Figure  4.14:  Cumulative  tag,  storage,  and  audit  costs  for  one  audit  per  hour. 


(a)  File  size  vs.  storage  and  audit  costs  (b)  File  size  vs.  storage  and  audit  costs 

Figure  4.15:  File  size  vs.  storage  and  audit  costs  for  files  at  one  audit  per 
hour  for  one  month. 
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CHAPTER  5: 
Conclusion 


We  have  developed  generic  cost  models  for  four  PDP  schemes,  which  can  be  used  to  in¬ 
fer  future  cost.  Additionally,  we  have  shown  that  audit  costs  of  some  sophisticated  PDP 
schemes  (A-PDP,  CPOR)  are  nearly  identical  to  those  of  the  simple  MAC-PDP  scheme; 
whereas,  tag  and  storage  costs  have  a  significant  impact  on  total  cost  differences  among 
the  schemes.  We  conclude  that  the  total  cost  of  MAC-PDP  and  CPOR  are  comparable, 
whereas  the  cost  of  A-PDP  becomes  expensive  relative  to  the  other  schemes  at  large  file 
sizes.  Our  preliminary  experimentation  shows  audit  cost  for  SEPDP  is  about  half  the  other 
schemes;  however,  the  scheme  is  limited  to  a  finite  number  of  audits. 

From  cost  projections  based  on  generic  models  for  MAC-PDP,  A-PDP,  and  CPOR,  we  find 
the  basis  cost  for  tagging  is  less  than  $1  for  a  1  GB  file;  $0.13  to  $20  for  a  1  TB  file;  and 
$135  to  $20,400  for  a  1  PB  file.  The  monthly  basis  cost  for  storage  is  a  fraction  of  a  cent 
for  a  1  GB  file;  $0.13  to  $1.50  for  a  1  TB  file;  and  $130  to  $1,500  for  a  1  PB  file.  The 
cost  for  a  single  audit  is  approximately  $0.02  for  files  larger  than  2  MB.  Combined  cost 
projections  incorporating  up-front  cost  to  tag  and  cumulative  cost  storing  and  auditing  at 
one  audit  per  hour  for  one  year  show  basis  costs  of  $160  to  $175  for  a  1  GB  file;  $170  to 
$230  for  a  1  TB  file;  and  $2,000  to  $38,700  for  a  1  PB  file. 

5.1  Future  Work 

While  our  benchmark  tests  covered  a  limited  number  and  type  of  PDP  implementations, 
future  studies  could  compare  schemes  that  incorporate  erasure  codes,  dynamic  data,  or 
distributed  file  system  storage,  among  other  variants.  Our  experiments  ignored  costs  as¬ 
sociated  with  transfer  time  and  service  latency,  focusing  instead  on  computational  costs. 
Follow-on  work  could  separate  the  client,  auditor,  and  prover  in  order  to  measure  the 
communication  costs  between  each  entity.  Fastly,  follow-on  work  could  compare  costs 
choosing  different  security  parameters.  In  our  experiments,  we  selected  security  param¬ 
eters  designed  to  normalize  comparison  in  terms  of  the  strength  of  audit  (as  defined  in 
Chapter  2).  Future  work  could  select  parameters  to  facilitate  scheme  comparison  in  terms 
of  other  properties,  such  as  strength  of  security  and  efficiency  of  recovery. 
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